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Abstract 

A problem of estimation of a Hermitian nonnegatively definite matrix p of unit 
trace (for instance, a density matrix of a quantum system) based on n independent 
measurements 

Yj =tr(pX j )+€ j , j = l,...,n 

is studied, {Xj} being i.i.d. Hermitian matrices and being i.i.d. mean zero 
random variables independent of {X, }. 
The estimator 



argmm se5 



n 



{Yj - tr(SXj)y + e tr(S*logS) 



is considered, where S is the set of all nonnegatively definite Hermitian m x m ma- 
trices of trace 1. The goal is to derive oracle inequalities showing how the estimation 
error depends on the accuracy of approximation of the unknown state p by low-rank 
matrices. 
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1 Introduction 



Let M m (C) be the set of all m x m matrices with complex entries and let 

S := <|s* £ M m (C) : S = S* , S> 0,tr(5) = l| 

be the set of all nonnegatively definite Hermitian matrices of trace 1. Here and in what 
follows S* denotes the adjoint matrix of S and tr(5) denotes its trace. The matrices from 
the set S can be interpreted, for instance, as density matrices, describing the states of a 
quantum system. Given a Hermitian matrix X {an observable), its expectation in a state 
p e S is defined as E p X := tr(pX). Let X u ...,X n € M m (C), Xj = X*, j = 1, . . . ,n 
be given Hermitian matrices (observables) and let p 6 S be an unknown state of the 
system. An important problem in quantum state tomography is to estimate p based on 
the observations (Xj,Yj), j = 1, . . . , n, where 

Yj = ti(pXj) j = 1, . . . , n, 

£j, j = 1, . . . , n being i.i.d. random variables with mean zero and finite variance repre- 
senting measurement errors. In other words, the unknown state p of the system is to be 
learned based on a set of measurements in a number of "directions" Xj, j = 1, . . . , n (see 
Artiles, Gill and Guta (2004) for a general discussion of statistical problems in quantum 
state tomography). In what follows, it will be usually assumed that the design variables 
X, X±, . . . , X n are also random, specifically, they are i.i.d. Hermitian m x m matrices 
with distribution n, and they are independent of the noise 

A typical choice of the design variables already discussed in the literature (see Gross 
et al (2009), Gross (2009)) can be described as follows. The linear space of matrices 
M m (C) can be equipped with the Hilbert-Schmidt inner product: (A, B) := tr(AB*). Let 
Ei, i = l,... ,m? be an orthonormal basis of M m (C) consisting of Hermitian matrices 
Ei. Let Xj, j = 1, . . . , n be i.i.d. random variables sampled from a distribution n on the 
set {Ei, . . . ,E m 2}. We will refer to this model as sampling from an orthonormal basis. 
Most often, the uniform distribution n that assigns probability vrT 2 to each basis matrix 
Ei will be used. Note that in this case E\(A, X)\ 2 = m~ 2 \\A\\l, where || • || 2 := (■, -) 1/2 is 
the Hilbert-Schmidt (or the Frobenius) norm. 

The following simple example is related to the problems of matrix completion exten- 
sively discussed in the recent literature (see, e.g., Candes and Recht (2009), Candes and 
Tao (2009), Recht (2009) and references therein). More precisely, it deals with a version 
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of matrix completion for Hermitian matrices (see Gross (2009)). In this case, when one 
knows an entry p^ of a matrix p, one also knows the entry pji = pij. 

Example 1. Matrix completion. Let {e^ : i = 1, . . . , m} be the canonical basis 
of C m . Then, the following set of Hermitian matrices forms an orthonormal basis of 
M m (C) : 

jej <g> ej : i = 1, . . . , mj j-t=(e; (g> e-,- + ej <g> a) : 1 < i < j < m j 

-^=(ej <g> - e«) : 1 < i < j < m 

which will be called the matrix completion basis. Here and in what follows <S> denotes the 
tensor product of vectors or matrices. Note that, for a Hermitian matrix p, observing 
inner products (p, Ei) with randomly picked matrices Ei from the above basis provides 
information about real and imaginary parts of the entries of the matrix, which explains 
the connection to the matrix completion problems. Another option is to consider the 
following basis of the space of all Hermitian matrices: 

jej <g> ei : i = 1, . . . ,m| U|^(ej <8> ej + ej (g> e,) + ^(e« (g> ej - ej <g> e«) : 1 < z < j < ml. 

Inner products of a Hermitian matrix p with the matrices of this basis are precisely the 
entries Pij,i < j of matrix p. If now n is the probability distribution (non-uniform) that 
assigns probabilities m~ 2 to the matrices ej(8>ej corresponding to the diagonal entries and 
probabilities 2m" 2 to other matrices of the basis, then £1(^4, X)| 2 = ttj, - 2 || Sampling 
from this distribution is equivalent to sampling the entries of the matrix p at random 
(again, recall that when one learns an entry pij one also learns pji = pij). 

Another example was studied by Gross et al (2009) and by Gross (2009) . It is more 
directly related to the problems of quantum state tomography. 

Example 2. Pauli basis. Let m = 2 k . Consider the Pauli basis in the space of 
2x2 matrices M 2 (C): Wj := ^<Tj, where 

ai ■= ( 5 1 ) ' ° 2 ■= ( °i 7 ) ' cts := ( I -°i ) and a4 := ( i 5 

are the Pauli matrices. Note that the Pauli matrices are both Hermitian and unitary. The 
Pauli basis in M 2 (C) can be extended to a basis in the space ofmxm matrices M m (C). 
These matrices define linear transformations acting in the linear space C m = C that can 
be viewed as a fc-fold tensor product of spaces C 2 : C 2 * = (C 2 )® fe . Then, the Pauli basis in 
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the space of matrices M 2 fc(C) consists of all tensor products W^tg)- • -<8>Wj fc , (ii, . . . ,ik) G 
{l,2,3,4} fc . As before, Ai,...,A„ are i.i.d. random variables sampled from this set of 
tensor products. Essentially, this is a standard measurement model for a k qubit system 
frequently used in quantum information, in particular, in quantum state and quantum 
process tomography (see Nielsen and Chuang (2000), section 8.4.2). 

Example 3. Subgaussian design. Another interesting class of examples includes 
subgaussian design matrices A such that {A, A) is a subgaussian random variable for 
each A G M m (C). (Recall that a random variable rj is called subgaussian with parameter 
a iff, for all A G R, Ee Xr) < e A2 °" 2 / 2 ). These examples are, probably, of less interest 
in applications to quantum state tomography, but this is an important model, closely 
related to randomized designs in compressed sensing, for which one can use powerful 
tools developed in the high-dimensional probability. For instance, one can consider the 
Gaussian design, where A is a symmetric random matrix with real entries such that 
{Xij : 1 < i < j < m} are independent centered normal random variables with IE A? = 
1, i = 1, . . . , to and EA^- = i, i < j. Alternatively, one can consider the Rademacher 
design assuming that Xa = en, % = 1, . . . , to and Xy = i < j, where {eij : 1 < i < 

j < m} are i.i.d. Rademacher random variables (that is, random variables taking values 
+1 or -1 with probability 1/2 each). In both cases, E\(A,X)\ 2 = \\A\\l, A £ M m (C) 
(such random matrices A will be called isotropic) and (A, X) is a subgaussian random 
variable whose subgaussian parameter is equal to ||^4||2 (up to a constant). 

The problems of this nature belong to a rapidly growing area of low rank matrix 
recovery. The most popular methods developed so far are based on nuclear norm regu- 
larization. 

In what follows, the Euclidean norm in the space C m will be denoted by | • | and 
the inner product will be denoted by (-, •) (with a little abuse of notation since it has 
been already used for the Hilbert-Schmidt inner product between matrices). We will 
denote by || • \\ p ,p > 1 the Schatten p-norm of matrices in M m (C) (and, if needed, in 

other matrix spaces). Specifically, := (y.J =1 A£(|A|)^ , where \A\ := (A* A) 1 / 2 

and, for a Hermitian matrix B, Afc(-B), k = 1, . . . , m are the eigenvalues of B (usually 
arranged in the decreasing order). In particular, || • ||i is the usual nuclear norm and || • ||2 
is the Hilbert-Schmidt norm. We will use the notation || • || for the operator norm. In 
addition to the metrics generated by these norms, some other distances will be of interest 
in connection to the statistical problems discussed in this paper. In particular, denoting 
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by II the distribution of the design matrix X, we will write 

PllL(n) == J(A,x) 2 U(dx)=E(A,X) 2 , A e M m (C) 

and we will often use the corresponding L2(II)-distance between matrices (say, between 
two states S±,S2 € S). This distance represents the prediction error in statistical prob- 
lems in question. 

In the noiseless case (i.e., when £j = 0), the following estimator of p has been 
extensively studied, especially, in the case of matrix completion problems (see Candes 
and Recht (2009), Candes and Tao (2009), Gross (2009), Recht (2009) and references 
therein) : 

p : = argmin(||S||i : S G M m (C), (S,Xj) =Yj,j = l,...,n 



Under some assumptions that resemble the restricted isometry conditions used in com- 
pressed sensing, it was shown that, with a high probability, p = p provided that the 
number n of observations is sufficiently large. Namely, up to logarithmic factors and 
constants, it should be of the order mr, where r is the rank of the target matrix p. 

In the noisy case, one has to deal with a matrix regression problem and the following 
penalized least squares estimator, which is akin to the LASSO used in sparse regression, 
was proposed and studied (see, e.g., Candes and Plan (2009), Rohde and Tsybakov 
(2009)): 

r n -i 

(1.1) 



p £ := argmin 5GMm(c) 



n- l Y.^-^( SX j)f + 4S\\i 

3=1 J 

where e is a regularization parameter. Note that these estimators are not constrained 
to the set S of density matrices (since for these matrices the nuclear norm is equal to 
1). Candes and Plan (2009) have also studied another estimator based on the nuclear 
norm minimization subject to linear constraints that resembles the Dantzig selector used 
in compressed sensing and Rohde and Tsybakov (2009) suggested estimators based on 
nonconvex penalties involving Schatten "p- norms" for p < 1. 

We will study the following estimator of the unknown state p defined as a solution 
of a penalized empirical risk minimization problem: 



p £ := argmin Se5 



n 

n- 1 J2( Y 3 ~ ^(SXj)) 2 + e tr(51og S) 

3=1 



;i.2) 



where e > is a regularization parameter. The penalty term is based on the functional 
tr(51og5) = —£(S), where £(S) is the von Neumann entropy of state S. Thus, the 
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method considered in this paper is based on a trade-off between fitting the model by the 
least squares in the class of all density matrices and maximizing the entropy of the state. 
One can also consider a slightly different estimator defined as follows: 



Of course, the estimator (jl.3p requires the knowledge of the design distribution II while 
the estimator (jl.2p can be also used in the cases when II is unknown. It happens that 
it is somewhat easier to study the properties of estimator (jl.3p than of (jl.2p for which 
one has to deal with more complicated empirical processes. Note that both optimization 
problems (jl.2p and (|1.3p are convex (this is based on convexity of the penalty term that 
follows from the concavity of von Neumann entropy, see Nielsen and Chuang (2000)). In 
what follows, we will study only the estimators defined by (jl.2p . 

A commutative version of entropy penalization and its connections to sparse re- 
covery problems in convex hulls of finite dictionaries have been studied by Koltchinskii 
(2009). In the current paper, this approach is extended to the noncommutative case. 

2 An Overview of Main Results 

The results of this paper include oracle inequalities for the L2(II)-error of the empirical 
solution p £ . They will be stated in a general form in sections 5 and 6. Here we formulate 
our results only in two of the special examples outlined in the Introduction: subgaussian 
isotropic design (such as Gaussian or Rademacher) and random sampling from the Pauli 
basis. Assume, for simplicity, that the noise {£.,■} is a sequence of i.i.d. N(0, er|) random 
variables (i.e., it is a Gaussian noise). 

Let t > be fixed and denote t m := t + log (2m), r n := t + loglog 2 (2n). 

First we consider the case of subgaussian isotropic design. Note that in this case 
ll-^llitffn) = 11-^11 2i A £ M m (C). Given a subspace L C C m , Pl denotes the orthogonal 
projection on L and L 1 - denotes its orthogonal complement. 

Theorem 1 Suppose X is a subgaussian isotropic matrix. There exist constants C > 
0, c > such that the following holds. Under the assumption that r n < cn, for all e € 



p £ := argmin Se5 
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[0, 1], with probability at least 1 — e * 

HlLmSC^IUogPllAiog^V^V^V 

frh{r n log n V t„ 



\P £ 



(* 6 V 



(2.1) 



Moreover, there exists a constant D > suc/i i/iai, /or a// e > Da^ (j^^- V ^ m ^j , 
luzf/i probability at least 1 — e , 

2||5 - p||i 2(n) + c(*\\ log Sg V 4 mdim l L) + T " V 

(2.2) 



This theorem includes two bounds on the L2(II)-error of /5 £ . The first bound (|2.ip 
holds for all e including e = 0, which is the case of the unpenalized least squares esti- 
mator. The term eM| logp|| A log i n this bound depends on the operator norm of 
log p and it has to do with the approximation error of the entropy penalization method 
(see Section 4). The second bound (12. 2p is an oracle inequality that controls the squared 
L2(II)-error of the estimator p £ in terms of approximation errors of oracles S 6 S. The 
term e 2 || log <S 1 1 1 i n this bound is also related to the approximation error of the entropy 
penalization method discussed in Section 4. This term depends on the Hilbert-Schmidt 
norm of log S. The dependence on e is better than in the first bound, but bound (|2.2p 
holds only for the values of regularization parameter above certain threshold. Clearly, in 
the second bound, the oracles S are to be of full rank (otherwise, log 5 does not exist and 
the right hand side of the bound becomes infinite). The random errors in these bounds 
are also different. In the first bound, it is of the order n -1 / 2 (up to logarithmic factors). 
In the second bound, the error term depends on how well the oracle S is approximated 
by low rank matrices. If there exists a subspace L of small dimension dim(L) such that 
\\PlxSP l x ||i is small (say, of the order n _1//2 ), then the random part of the error in (|2.2p 
is essentially controlled by o~| dim ^) m _ 

It will be shown later in the paper how to derive from the bounds of Theorem [1] and 
more general bounds for oracles of full rank some other inequalities for low rank oracles. 
In particular, for subgaussian isotropic design and Gaussian noise, this approach yields 
the following result. To simplify its formulation, we will assume that, for some constant 
c > 0, T n < cn and t m < n. 
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Theorem 2 Suppose X is a subgaussian isotropic matrix. There exist a constant c > 
and, for all sufficiently large D > 0, a constant C > such that, for e := Da^ 1 "" 
with probability at least 1 — e - *, 

'<j|rank(5)mt m log 2 (mn) . . m ( Tn log n V t 



\P £ -p\\ L2{ u)<lf s 



ns- P \? Lm +c 



n 



V 



n 



(2.3) 



A simple consequence of the first bound of Theorem Q] and the bound of Theorem 
[21 is the following inequality that holds with probability at least 1 — e _i and with some 
C > for e := Da^ ^ 



,e „,|2 <c 



\P -Plli 2 (n) 



log (mn) /\ 



a|rank(p)mt m log 2 (ran) \ m ( Tn l og nVt 



r? 



V 



Next we consider the case of sampling from the Pauli basis. In this case, ||^4||x, 2 (n) 
m _1 ||^4||2, A £ M m (C). As before, we fix t > and assume that t m < n. 



Theorem 3 Suppose that X is sampled at random from the uniform distribution II on 
the Pauli basis. Then, there exists a constant C > such that, for all e £ [0,1], with 
probability at least 1 — e _t , 



\P -PllzaCn) 



2 < C 



logp||Alog(^))V(^Vm-V^ 



(2.4) 



In addition, for all sufficiently large D > 0, there exists a constant C > such that, for 

V n 



luzf/i probability at least 1 — e *, 



|p £ -p|lL(n)< mf 



2||S-p|li 2 (n) + CKVm 



2 _ 1 rank(S')mt m log 2 (mn) 



(2.5) 



Similarly to the previous theorems, one can easily derive from Theorem [3] the fol- 
lowing bound 

.J. rank(p)mi m log 2 (ran) 



\P £ -P\\UU)<C 



foV 



-l/2\ 



— ^- log(mn) A (aj V m 



t n „A ^ n t„„ , nl * . ™ — 1/2 



that holds with probability at least 1 — e and with some C > for e = £>(ct«™ 7 V 



m -l) A /*za. 
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It is worth mentioning that the results of sections 4, 5 provide a way to bound 
the error of estimator p £ not only in the L2(II)-distance, but also in other statistically 
important distances such as noncommutative Kullback-Leibler, Hellinger and nuclear 
norm distance (see Section 3.1 for their definitions). For instance, under the assumptions 
of Theorem [TJ the following bound for the symmetrized Kullback-Leibler distance holds 
with probability at least 1 — e _i : 



K(p £ ;p)<- inf 

e LcC" 



£ 2 iuogpiilV^ 2 mdim( n L)+rn V 

(2.6) 



n d d II mt m\i, ,. /—^ sfmijn log n V t 



ti n 



In the case of sampling from Pauli basis (as in Theorem [3]), it is easy to derive from 
Theorem [5] of Section 5 (using also some bounds from the proofs of Proposition [5] and 
Corollary [I]) the following bound on the squared Hellinger distance between p £ and p : 



H 2 (p £ ;p) < C{a/:Vm 



_ 1/2 rank(/o)Vmt m log (mn) 



n 



that holds with probability at least 1 — e ' for e = D(a^m l l 2 V m 

It has been already mentioned that the first bounds of theorems [1] and [3] (bounds 
(|2.ip and (|2.4p ) hold for all e > 0, even in the case of unpenalized least squares estimator 
with e = 0. The random error parts of these bounds are (up to logarithmic factors) 
of the order n -1 / 2 as n — > oo. Bounds (|2.2p . (|2.3p and (|2.5p are based on more subtle 
analysis taking into account the ranks of the oracles S approximating the true density 
matrix p. In these bounds, the size of the L2(II)-error \\p £ — p\\ 2 L ^^ is determined by a 
trade-off between the approximation error \\S — p||^ 2 (n) °^ an orac l e $ an d the random 

error. In the case of bounds (|2.3p and (|2.5p . the last error is of the order — — (up 

to logarithmic factors), and it depends on the rank of the oracle S. In particular, taking 

.. cr?rank(p)r?i , 

d = p, we can conclude that \\p £ — p||£ 2 m) 1S bounded by -s — (up to constants and 

logarithmic factors) . This means that von Neumann entropy penalization mimics oracles 
that know precisely which low rank matrices approximate p well and can estimate p 
by estimating a "small" number of parameters needed to describe such oracles. This 
could be compared with recent results for nuclear norm penalization (Candes and Plan 
(2009), Rohde and Tsybakov (2009)). Depending on the values of a^,m,n and other 
characteristics of the problem more "rough" bounds (|2.1j) and fj2.4j) might become even 
sharper than more "subtle" bounds (|2.2|) . (|2.3|) and (|2.5|) (see Rohde and Tsybakov 
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(2009) for a discussion of a similar phenomenon). Since the random error term in more 
"subtle" bounds is proportional to cr| and in the "rough" bounds it is proportional to 
0£, the "rough" bounds become sharper for the values of standard deviation of the noise 
<7£ above a threshold that depends on n and m. Thus, the rate of convergence of the 
L2(n)-error to zero in a particular asymptotic scenario (when certain characteristics are 
large) is determined by the bounds of both types. 

Theorems [HOE] and other results of a similar nature will follow as corollaries from 
more general oracle inequalities that we establish under broader assumptions on the de- 
sign distributions and on the noise. To prove these results, we need several tools from the 
empirical processes and random matrices theory, such as noncommutative Bernstein type 
inequalities and generic chaining bounds for empirical processes. We will discuss these 
results in Section 3 (as well as some properties of noncommutative Kullback-Leibler, 
Hellinger and other distances between density matrices). We will then study approxi- 
mation error bounds for the solution of von Neumann entropy penalized true risk min- 
imization problem (Section 4) and, finally, in sections 5 and 6, derive main results of 
the paper concerning random error bounds for the empirical solution p £ . More precisely, 
we bound the squared L2(II)-distance \\p 6 — S'H^m) an d symmetrized Kullback-Leibler 
distance K{p e ; S) from p e to an arbitrary "oracle" S £ S and derive oracle inequalities 
for the squared L2(n)-error \\p 6 — pHf^m of the empirical solution p £ . These results are 
first established for oracles S of full rank and expressed in terms of certain characteristics 
of the operator log S (which is, essentially, a subgradient of the von Neumann entropy 
penalty used in (|l-2p ). Using simple techniques discussed in Section 4, we then develop 
the bounds for low rank oracles S (such as the bounds of theorems [5] and [3]) and also 
obtain oracle inequalities for so called "Gibbs oracles" . Note that the logarithmic factors 
involved in the bounds of theorems [2] and (and in other results of this type discussed 
later in the paper), in particular, the factor log 2 (mn), are related to the need to bound 
certain norms of log S for special oracles S G S (as in Theorem [T]). In the case of ||S r ||i- 
penalization, log S should be replaced with a version of sign(S') and one can avoid some 
of the logarithmic factors in this case. 
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3 Preliminaries: Distances in S, Empirical Processes and 
Exponential Inequalities for Random Matrices 

3.1 Noncommutative Kullback-Leibler and other distances 

We will use noncommutative extensions of classical distances between probability distri- 
butions such as Kullback-Leibler and Hellinger distances. These extensions are common 
in quantum information theory (see Nielsen and Chuang (2000)). In particular, we will 
use Kullback-Leibler divergence between two states Si,S 2 £ 5 defined as 

tf(Si||S 2 ) == E Sl (logSi - log,S 2 ) = tr(Si(logSi - log,S 2 )) 

and its symmetrized version 

K(S 1 ;S 2 ) := tf(Si||S 2 ) + K(S 2 \\S 1 ) = tr((5i - S 2 )(logS 1 - logS 2 )). 

We will also use a noncommutative version of Hellinger distance defined as follows. For 
any two states S\,S 2 £ S, let F(Si,S 2 ) := ti\J ' S 2 S\^ 2 . This quantity is called the 
fidelity of states S±, S 2 (see, e.g., Nielsen and Chuang (2000), p. 409). Then, a natural def- 
inition of the squared Hellinger distance is H 2 (Si, S 2 ) := 2(1 — F(S±, S 2 )). A remarkable 
property of this distance is that 

H 2 (S 1} S 2 )= sup H 2 ({ Pi };{ qi }) = sup^fypl- , 

i 

where the supremum is taken over all POVMs {E{\ (positive operator valued measures) 
and pi := tr(SiEi), qi := tr(S 2 Ei). [In the discrete case, a positive operator valued 
measure is a set {E{\ of Hermitian nonnegatively definite matrices such that = I]- 

Thus, the quantum Hellinger distance is just the largest "classical" Hellinger distance 
between the probability distributions {pi},{qi} of a "measurement" {E{\ in the states 
Si,S 2 (see Nielsen and Chuang (2000), p. 412). The same property also holds for two 
other important "distances", the trace distance ||Si — S 2 \\\ and the Kullback-Leibler 
divergence K(Si\\S 2 ) (see, e.g., Klauck et al (2007)). These properties immediately imply 
an extension of classical inequalities for these distances: 

||5i-5 2 ||?<^ 2 (<5i^2)<^(5i||5 2 ). 

They also imply the following simple proposition used below. It shows that, if two matri- 
ces S\,S 2 are close in the Hellinger distance and one of them (say, S 2 ) is "approximately 
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low rank" in the sense that there exists a subspace L C C m of small dimension such that 
|| P l ±S2Pl± ||i is small, then another matrix Si is also "approximately low rank" with 
the same "support" L. 

Proposition 1 For all subspaces L C C m and all Si, £2 £ S, 

\\PlSiP l \\i < 2||P L S 2 PL||i + 2iJ 2 (Si,S 2 ). 

Proof. Indeed, take an orthonormal basis {ei, . . . , e m } in C m such that L = l.s.({ei, . . 
Let pj := (S±ej, ej) = tr(Si(ej (8) ej)) and q,j := {S?,ej, ej) = tr(S2(ej (8) e^)). Then 

# 2 (Si,S 2 ) > J^(v^- v^) ^H(V^-V^) = E^ + E^~ 2 Ev^jv / ^> 



which implies (using that 2\/a6 < a/2 + 26) 



j=i j=i j=i 



|PlSiP l ||i = J> < 2^ - £ a, + iJ 2 (Si, S 2 ) < 



j'=i 



^P j + ^g J +i? 2 (Si,S 2 ) = ^||P L SiP L ||i + ||P l S 2 Pl||i + F 2 (Si,S 2 ), 



j=i j"=i 
and the result follows. 



3.2 Empirical processes bounds 

We will use several inequalities for empirical processes indexed by a class of measurable 
functions T defined on an arbitrary measurable space (S, A). Let X, Xi, . . . , X n be i.i.d. 
random variables in (S, A) with common distribution P. If T is uniformly bounded by a 
number £/, then Bousquet's version of the famous Talagrand's concentration inequality 
for empirical processes implies that, for all t > 0, with probability at least 1 — e~ t 



sup 



n- l Y,f{X 3 )-^f(X) 



< 2 


E sup 







n 



E/(X) 



n n 



where a 2 := supj g jr Varp(/(X)). We will also need a version of this bound for function 
classes that are not necessarily uniformly bounded. Such a bound was recently proved 
by Adamczak (2008). Let F(x) > swpf eJ r\f(x)\,x € S, be an envelope of the class. It 
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follows from Theorem 4 of Adamczak (2008) that, there exists a constant K > such 
that for alH > with probability at least 1 — e~* 



sup 



n 



E/(X) 




E sup 













+ 



max \F(Xj)\ 



In addition to this, we will need to bound the following expectation: 

n 

■^/^-eAi). 



E sup 



n 



3=1 



A usual approach to this problem is to use symmetrization inequality to replace the 
empirical process by a Rademacher process, and then to use Talagrand's comparison 
(contraction) inequality (see, e.g., Ledoux and Talagrand (1991), Section 4.5) to get rid 
of the squares. This, however, would require the class T to be uniformly bounded by some 
U > 0, which is not too large. This approach is not sufficient in the case of subgaussian 
design considered in the last section. A more subtle approach has been developed in 
the recent years by Klartag and Mendelson (2005), Mendelson (2010) and it is based on 
generic chaining bounds. 

Talagrand's generic chaining complexity (see Talagrand (2005)) of a metric space 
(T,d) is defined as follows. An admissible sequence {A n } n >o is an increasing sequence 
of partitions of T (i.e., each next partition is a refinement of the previous one) such that 
card(Ao) = 1 and card(A n ) < 2 2 ™, n > 1. For i £ T, A n (i) denotes the unique subset in 
A ra that contains t. For a set A C T, D(A) denotes its diameter. Then, define the generic 
chaining complexity 72 (T; d) as 



l2 {T-d) := inf sup^2«/ 2 J D(A„(t)), 

{A„}„> teTn>Q 



X(8))< 



where the inf is taken over all admissible sequences of partitions. 

If {X(t) : t G T} is a centered Gaussian process such that E(A(i) 
d 2 (t, s), t,s 6 T, then it was proved by Talagrand that 

K- l l2 {T;d) < EsupA(t) < K l2 (T;d), 

where K > is a universal constant. Thus, the generic chaining complexity 72 (T; d) is a 
natural characteristic of the size of the Gaussian process X(t),t G T. 

Similar quantities can be also used to control the size of empirical processes indexed 
by a function class T . It is natural to define 72(^5 ^(P)), that is, 72 (-T 7 ; d), where d is 
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the Z/2(-P)-distance. Some other distances are also useful, for instance, the ^-distance 
associated with the probability space (S,A,P). Recall that, for a convex increasing 
function -ip with ip(0) = 0, 



infl C > 



-■JAW 



dP < 1 



(see van der Vaart and Wellner (1996), p. 95). If ip(u) = u p ,u > 0, for some p > 1, 
the corresponding ^-norm is just the L p -norm. Other important choices are functions 
ip a {t) = e* a — 1, t > 0, a > 1, especially, 4 , 2 that is related to subgaussian tails of / and 
ipi that is related to subexponential tails. 

The generic chaining complexity that corresponds to the ^2-distance will be denoted 
by 72 0^"; '02)- Mendelson (2010) proved the following deep result (strengthening previous 
results by Klartag and Mendelson (2005)). Suppose that T is a symmetric class, that is, 
/ € T implies — / G J 7 , and Pf = Kf(X) = 0, / G J. Then, for some universal constant 
K > 0, 



E sup 



n 



■ 1 ^/ 2 (J i )-E/ 2 (J) 



< K 



I,,,, 72(^^2 \ / 72(^^2) 
SUP / 7= V 

L/e.F 



n 



3.3 Noncommutative Bernstein type inequalities 

We will need the following operator version of Bernstein's inequality which is due to 
Ahlswede and Winter (2002) (and which has been already successfully used in the low 
rank recovery problems by Gross et al (2009), Gross (2009), Recht (2009)). 

In this subsection, assume that X, Xi, . . . , X n are i.i.d. random Hermitian m x m 
matrices with EX = and o\ := ||EX 2 ||. 

Bernstein's inequality for operator valued r.v. Suppose that \\X\\ < U for 
some U > 0. Then 

P{|* + ■ ■ ■ + XJ > *} < ^.exp{- 24n ; 2 2Cft/3 }. (3.1) 



In fact, we will frequently use the following bound that immediately follows from 
the version of Bernstein's inequality given above: for all t > 0, with probability at least 
1 - e"* 

h + log(2m) v , t + log(2m) 



X x + ■ ■ ■ + X n 



n 



<2 a x 



n 



n 



(3.2) 
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Moreover, it is possible to replace the Loo-bound U on ||X|| in the above inequality 



by bounds on the weaker -0 Q -norms. Denote U x 



(a) .. 



\x\ 



a > 1. 



Proposition 2 Let a > 1. There exists a constant C > such that, for all t > 0, with 
probability at least 1 — e _t 



X x + ---+X n 



n 



< C [a x 



t + log(2m) 



V4 a) (log 



U [ x ] \ 1/a t + log(2m) 



n 



. (3.3) 



Note that, in the limit a — > oo, inequality (|3.3j) coincides with (|3.2p (up to a con- 
stant). 

Proof. Similarly to the proof of (|3.ip discussed in the literature (Ahlswede and 
Winter (2002), Gross (2009), Recht (2009)), we follow the standard derivation of classical 
Bernstein's inequality and we use the well known Golden-Thompson inequality (see, e.g., 
Simon (1979), p. 94): for arbitrary Hermitian matrices A,B £ M m (C), tv(e A+B ) < 

tr(e A e B ). Let Y n := X\ H h X n . Note that < t if and only if -tl m < Y n < tl m . 

Therefore, 

n\\Yn\\ >t}= F{Y n £ tl m } +¥{Y n t -tl m }. (3.4) 
The following bounds are straightforward by simple matrix algebra: 

F{Y n £ tl m } = P{e XYn % e xtIm } < p|tr(^e Ay ") > e Ai } < e" A *Etr(e Ay "). (3.5) 

To bound the expected value in the right hand side, we use independence of random 
variables X±,... ,X n and Golden-Thompson inequality: 

Etr(e Ay ") =Etr(e Ay "- 1+AXn ) < Etrfe AYn - 1 e AXn ) = tr ( ¥.(e XYn - x e xx ' 



tr^Ee^'^Ee^™ | < Etrf e Ay «-i 
By induction, we conclude that 



Ee 



Etr(e Ay ") < Etr(e AyLl 



xx 2 



Since Etrf e XXl ) = trf Ee AXl ) < 



m 



Ee 



Etr(e Ay ") < m 



Ee 



we get 



Ee 



Ee 



xx 



(3.6) 
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It remains to bound the norm ||Ee AX ||. To this end, we use Taylor expansion and the 
condition EX = to get 



Ee AX =I m + EX 2 X 2 



1_ AX X 2 X 2 
2! + ~3l~ + 4! 



< 



J m + A 2 EX 2 



2! 3! 
Therefore, for all r > 0, 



1 AIIXII A 2 ||X|| 2 



Ee 



AX 



1 + A 5 



EA~ 



4! 

< 1 + A 2 
1 - At" 



I m + A 2 EX 2 



e A||X|| _i_ A || X | 

A 2 ||Al 2 



EX' 



\ 2 T 2 



+ A 2 E||X| 



; A||X|| _i-A|pr| 
A 2 ||X|| 2 

e Hx\\-i-X\\ X 



< 



A 2 ||Al 2 



I(\\X\\>r). 



Let M := 2(log 2) 1 / a U i x ' ) and assume that A < 1/M. Then 
■ e W-l-A||X 



EIIXI 



I(\\X\\ >t)< M 2 Ee l|xil/M /(||X|| > r) < 



A 2 ||X|| 2 

M 2 E l/2 e 2||X||/M p l/2||| X || > t} 



\\x\\ 


> 2 


\\X\\ 









Since, for a > 1, M = 2(log2) 1 / c 
(1996), p. 95), we have Ee 2 ^ x ^ M < 2 and also 

\X\\ >t}< exp{-2 a log2 



(see van der Vaart and Wellner 



1.) 

M J 



As a result, we get the following bound 

s Ar - 1 - At 



Ee 



AX 



< 1 + \ 2 (Tx 



A 2 T 2 



+ 2 1 / 2 A 2 M 2 expj log 2 (j^J ° \ . 



Let r := M ^ 2 ^^^ log 1 /" ^ and suppose that A satisfies the condition Ar < 1. Then, 
the following bound holds with some constant C\ > : 



Ee 



AX 



< 1 + Ci\ 2 a 2 x < exp{CiA z ai}. 



2_2 



Thus, we proved that there exist constants Ci, Ci > such that, for all A satisfying the 
condition 

jj(a) \ l/a 



A ( log " x 



OX 



<c 2 , 



(3-7) 
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we have 
get 



Ee 



xx 



< exp{CiA 2 cr|}. This can be combined with ([331), $M and to 



W{\\Y n \\ > t} < 2mexp|-At + CiA 2 no-|}. 



It remains now to minimize the last bound with respect to all A satisfying (|3.7p to get 
that, for some constant K > 0, 



^{\\Y n \\ >t}< 2mexp 
which immediately implies f|3.3|) . 



1 



K 



nol + tU^hog^UP/ox) 



Note that, in a standard way, one can deduce bounds on the expectation from the 
exponential bounds on tail probabilities. In particular, (|3.ip implies that 



E 



X 1 + --- + X n 



n 



<C[a x 



log(2m) v / log(2m 



n 



n 



Similarly, Proposition [2] implies that 



E 



X\ + • • • + X n 



n 



<C a x 



log(2m) 



n 



(3.8) 



(3.9) 



Combining the last bounds with Talagrand's concentration inequality leads to somewhat 
different versions of bounds (|3.2|) and (|3.3|) that can be better in some applications. 
Namely, denote 

E\{Xu,v)\ 2 . 



o\ '■= sup 

u,v£C m ,\u\<l,\v\<l 



It is easy to check that a\ < a\. Moreover, in some cases, it can be significantly smaller 
(for instance, if X is sampled at random from the matrix completion basis, then a x is of 
the order to -1 and a\ is equal to m~ 2 ). The expectation bound (|3.8p and Talagrand's 
concentration inequality imply that with probability at least 1 — e~ t 



X\ + • • • + X n 



n 



<C[a x 



log(2m) 



n 



Similarly, combining the expectation bound (|3.9j) for a = 1 with Adamczak's version of 
Talagrand's inequality (see Section 3.2), we get that with probability at least 1 — e~ l 



X x + ---+X n 



n 



< Ciax 



log(2m' 



U§) \ log(2m) 



n 



n 

(3.11) 
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In the examples when a\ is significantly smaller than a\, these bounds might be better 
than (|3.2p and (|3.3p . especially, when they are used for large values of t. 

In principle, using bounds (|3.10|) and (|3.11|) in the proofs of the following sections 
instead of (|3.2p and (|3.3p provides a way to obtain probabilistic oracle inequalities with 
probabilities of the error decreasing exponentially with m or n (this is the way in which 
error bounds are written in the papers by Candes and Plan (2009) and Rohde and 
Tsybakov (2009)). We are not pursuing this approach here. 

4 Approximation Error 

A natural first step in the analysis of the problem is to study its version with the true 
risk instead of the empirical risk. The true risk with respect to the quadratic loss is equal 
to 



where we used the assumptions that X and £ are independent and E£ = 0. Thus, the 
penalized true risk minimization problem becomes 



and the goal is to study the error of approximation of p by p £ depending on the value 
of regularization parameter e > 0. The next propositions show that if there exists an 
oracle S G S that provides a good approximation of the true state p in a sense that 
\\S — p\\l 2 (u) is small, then p £ belongs to an L2(II)-ball around S of small enough radius 
that can be controlled in terms of the operator norm || log or in terms of more subtle 
characteristics of the oracle S. They also provide upper bounds on the approximation 
error \\p e - p\\l 2 (ti)- 

We will first obtain a simple bound on \\p e — iS||^ 2 (n) f° r an arbitrary oracle S £ 
S of full rank expressed in terms of the operator norm ||logS'|| of its logarithm. For 
simplicity, we assume that [|log5|| = +oo in the case when rank(S') < m (and log S is 
not defined). Note, however, that tr(S'logS') is well defined and finite even in the case 
when rank(S') < m. 

Proposition 3 For all S € S, \\p £ — SHi^n) < \\S — p\\l 2 (u) + \/ e ll log . This implies 



E(Y - (S, X)) 2 = E({p, X) + £ - (S, X)) 2 = E{S - p, X) 2 + E£ 2 , 



p £ := argmin 5eiS E(S - p, X) 2 + e tr(5 log S) 



(4.1) 



that 




P e ~ p\\l 2 (u) < 2\\S - p\\l 2 (U) + V £ \\ logS 
S = p, Hp £ -p|lL(ro ^ £ ll 1 °g/ 9 ll- 
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For a differentiable mapping g from an open subset G C M m (C) into M m (C), denote 
by Dg(A;H) its differential at a matrix A £ G in the direction iif E M m (C), that is, 
Dg(A; H) is linear with respect to H and 

g(A + H) = g(A) + Dg(A; H) + o(\\H\\) as \\H\\ -> 0. 

The following lemma is a simple corollary of Theorem V.3.3 in Bhatia (1996): 

Lemma 1 Let f be a function continuously differentiable in an open interval I C R. 
Suppose that A is a Hermitian matrix whose spectrum belongs to I. Then the mapping 
B \-+ g{B) := tr(/(B)) is differentiable at A and Dg(A;H) = ti(f'(A)H). 

Proof of Proposition [3]. Denote the penalized risk 

L(S) :=E{S - p,X) 2 + e tr(Slog5). 

It is easy to see that the solution p £ of problem (|4.1|) is a full rank matrix. To prove this, 
assume that rank(/9 £ ) < m. Let p := (1 — 5)p £ + <5/ m , where I m is the m x m identity 
matrix. Then, for small enough 5, p is a full rank matrix and it is straightforward to 
show that the penalized risk L(p) is strictly smaller than L{p £ ) (for some small 5 > 0). 
It is also easy to check that, for any S £ S of full rank, log S is well defined and the 
differential of the functional L in the direction v G M m (C) is equal to 

DL(S;u) = 2E(S - p,X){u,X) +e tr(i/log5). 

This follows from the fact that the first term of the functional L is differentiable since it 
is quadratic. The differentiability of the penalty term is based on Lemma [1] (it is enough 
to apply this lemma to the function f(u) = ulogu). Since p £ is the minimal point of L 
in S, we can conclude that, for an arbitrary S £ S, DL(p £ ; S — p £ ) > 0. This implies 
that 

DL(S; S- p £ )- DL(p £ ; S - p £ ) < DL(S; S - p £ ), 
which, by a simple algebra, becomes 

2115 " P £ \\l 2 (u) + £K(S; p £ ) < 2E(S - p, X)(S - p £ ,X) + e (S - p £ , log S). (4.2) 

To conclude the proof, note that (|4.2p . the bound \\S — p £ \\i < 2 and Cauchy-Schwarz 
inequality imply that 

nS - P £ \\l 2{U ) + zK{S- p £ ) <2\\S- p\\ L2{u) \\S - p s \\ Lm + 2e\\ log S\\. 
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Solving the last inequality with respect to \\p e — S'Hwn) an d using the fact that K(S; p £ ) > 
0, yields the bound 



\\P -S\\ L2{U) < + y — ^ +e[|log5||, 

which implies \\p £ — S\\L 2 (m < \\S — /o||i 2 (n) + \l e ll log £jj > an d the result follows. 

□ 

To obtain more subtle bounds with approximation error of the order 0(e 2 ) instead 
of 0(e), we introduce and use the following quantity 

a(W) := a n (W) := a x {W) := sup|(W,C7) : U € M m (C),U = C/*,tr(C7) = 0, ||*7|U s( n) = l}, 

which will be called the alignment coefficient of W. Similar quantities were used in the 
commutative case (Koltchinskii (2009)). Note that, for all constants c, 

a(W + cl m ) = a{W) (4.3) 

(since (I m , U) = for all U of zero trace). In addition, we have 

a cX (W) = ^-a x (W), c^O. (4.4) 
\c\ 

Let {Ei :* = !,..., m 2 } be an orthonormal basis of M m (C) consisting of Hermitian 



matrices and let K, := I (Ej,Ek) l 2 <u) ) b e the Gram matrix of the functions {(Ej, ■) : 

V / j,k=l 

j = 1, . . . , m 2 } in the space L 2 {IV). Clearly, the mapping J : M m (C) h-> ^f 1 (C), 

JC/ = ((£/,£;} : j = l,...,m 2 ), ?7eM m (C), 

is an isometry. If now we define K, : M m (C) i— > M m (C) as K, := J _1 /CJ, then we also 
have K}l 2 = J~ 1 JC 1 / 2 J, Kr x l 2 = J~ x Kr x l 2 J. As a consequence, for any matrix U = 

Em? T7i 
j : " ; 

m 2 

IMILOU = E (^',^)L 2( n)%^ = (JCu,v) ta = WIC^uWl = \\iC l ' 2 U\\l 
j,k=i 

and it is not hard to conclude that a(W) < \\K,~^/ 2 W\\2- Moreover, in view of (|4.3|) . for 
an arbitrary scalar c, 

a(W) < \\K- l l 2 {W + cI m )\\ 2 . 
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This shows that the size of a(W) depends on how W is "aligned" with the eigenspaces 
of the Gram matrix KL. In a special case when, for all A, ||-A||.L 2 (n) = ||-4||2> the functions 
{{Ej, •) : j = 1, . . . , m 2 } form an orthonormal system in the space £2(11) and the Gram 
matrix K, is the identity matrix. In this case, we simply have the bound 

a(W) < mi \\W + cI m \\ 2 . 

c 

In the next statement, we use the alignment coefficient a(log 5) to control the L2(n)- 
distance \\p £ — 5||£ 2 m) and the Kullback-Leibler "distance" K(p £ ; 5) from the true solu- 
tion p £ to an arbitrary oracle 5. 

Proposition 4 For all 5g5, 

\\P £ -S\\l 2{u) + ^K(P £ ;S)< (\\S-p\\ Lm + ^ a (logS)) . 

In particular, it implies that \\p £ — pll^m) + f^(/° £ !P) ^ x a2 (^°S/°)- Moreover, the 
following bound also holds: 



\P £ -P\\L 2 (U)<^ S 



s ~ PllLn) + ea(log S)\\S - p\\ Lm + T a 2 (log5) 



Proof. Our starting point is the relationship (|4.2p from the proof of Proposition [3j 
It follows from the definition of a(W), from (|4.2p and from Cauchy-Schwarz inequality 
that 



2IIS " P £ \\l 2i n) + ^(«5;P e ) < 2||5 - P \\l 2 (u)\\S - p £ \\ L2 (n } + eaQog S)\\S - P £ \\ L2 {uy 

It remains to solve the last inequality for H5* — p £ ||^ 2 (n) to obtain the first bound of the 
proposition. The second bound is its special case with 5 = p. To prove the third bound 
note that, by the definition of p £ , for all S G S, 

\\P £ -Pll! 2 (n) + etr(p £ logp £ ) < \\S - p\\l 2{Yl) + etr(51og5), 

which implies 

\\p e - p\\l 2i n) < ||5 r -p||i 2(n) + £ (tr(S'log5')-tr(p e logp £ )) < 

\\S ~ P\\l 2 (u) + etr(log 5(5 - p £ )) < \\S - p||| 2(n) + ea(log S)\\p £ ~ S\\l 2 (u), 
where we used the fact that, by convexity of the function 5 h-> tr(51og5), 

tr(51og5) -tr(p £ logp E ) < tr(log 5(5 - p e )). 
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It remains to bound \\p £ — S'H^m) from above using the first inequality of the proposition. 



A consequence of propositions [3] and [H is that 



\P £ ~ Hll 2 (n) < -ja 2 {\ogp) A e|| logp||. 



(4.5) 



We will now provide versions of approximation error bounds for special types of 
oracles S G S. 

Low Rank Oracles. First we show how to adapt the bounds of Proposition 0] 
expressed in terms of the alignment coefficient a (log S) for a full rank matrix S (for 
which log S is well defined) to the case when S is an oracle of a small rank r < m. 
For a subspace L of C m , denote A(L) := sup^y^ (n) <i ||-Pl^4-Pl||2- Suppose that S G S 
is a matrix of rank r. To be specific, let S = X^=i7i( e i ® e j)' where 7j are positive 
eigenvalues of S" and {ei, . . . ,e m } is an orthonormal basis of C m . Let L be the linear 
span of the vectors ei, . . . , e r . 



Proposition 5 There exists a numerical constant C > suc/i i/iat, /or aZZ e > 0, 



|p £ -p|l! 2( n)<2||5-p||| 2(n) + C £ 2 



A 2 (L)rlog 2 1 + 



771 

e A 1 



+ EII-X1 



Proof. Note that, for all matrices W of rank r "supported" in the space L in the 
sense that W = PlWPl, we have 

a{W)< sup (W,C7)= SU P {W,P L UP L ) < k(L)\\W\\ 2 . 

l|C/|lz, 2( n)<l l|C/|lL 2( n)<l 

For 5 e (0, 1), consider 5,5 := (1 - 5)S + 5^. Then, using the fact that a{W + cl m ) = 
a(W), we get 



r 

log S 5 = ( lo g((f - s hj + 5 / m ) - log(5/m)\ (ej ® e,-) + log(<5/m)7„ 



and 



i(logS 5 ) = aQ^(log((l - (5)7,- + <5/m) - log(<5/m)) (ej ® < 



A(L) 



^(k>g((l - 8)jj + <5/m) - log(<5/m)J (e 3 - <g> 



< 
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A(L) (j^ log 2 (l + ^) ) V2 < A(L)V^log (l + 

Note also that \\S - S*||| a( m = 5 2 \\S - W™ll| 2 (n) < 4<5 2 E||X|| 2 , since 

\\S - I m /m\\ 2 L2{u) <2(E{S,X) 2 +E(I m /m,X} 2 ) < 

2(||S|| 2 E||X|| 2 + ||/ m /m|| 2 E||X|| 2 )<4E||X|| 2 . 
Thus, it easily follows from Proposition H] that 

\\p £ - p||! 2(n) <l\\S S - p\\l m + eV(log S 5 ) < 



m\\S\\ 



2 

3/4 



\\S- p\\l 2 (ii) + \\Ss - 
, K3 \\S-pf L2( n)+M\S s *" 2 



+ A"(L)re 2 log 2 ( 1 + ^ ) < 



in 



+ A 2 {L)re 2 log 2 ( 1 + ^ ) < 



in 



nS - Plll 2 (n) + m\X\\ 2 8 2 + A 2 (L)re 2 log 2 (l + y) ■ 
Taking 5 = e A 1, this yields 



||p £ -p|ll 2( n ) <2||5- /3 |l! 2( n) + ^ 2 



with a numerical constant C > 0. 



A 2 [L)r log 2 1 + 



m 
e A 1 



+ E||X|p 



Remark. The bound of Proposition [5] can be also written in the following form that 
might be preferable when E||X|| 2 is large: 



\\p £ -p\\l 2{ u ) <2\\S-p\\ 2 L2{n) + Ce 2 



A (L)r log ( 1 + 
A 1. 



mE^WXW' 



Vm + 1 



In the proof, it is enough to take 5 := El / 2 ^ x ^ 2 

Note that if {Ei, i = 1, . . . , m 2 } is an orthonormal basis of M m (C) consisting of 
Hermitian matrices and X is uniformly distributed in {Ei, i = 1, . . . , m 2 }, then for all 
Hermitian A 



2 L2{u) =E(A,X) 2 = m- 2 ^(A,E J 

i=i 



m 



-2 
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Therefore A(L) < su P||A|| L2(n) <i ll^-lh = sup|| J 4|| 2<m ||A||2 = m. Also, in this case ||X|| < 
\\X\\2 = 1. Thus, Proposition [5] yields 

M ~ P\\l m <*\\S- P\\i m + Cm 2 re 2 log 2 (l + + Ce 2 . 

Gibbs Oracles. Let H be a Hermitian matrix ("a Hamiltonian" ) and let /3 > 0. 
Consider the following density matrix (a "Gibbs oracle"): 

PH ^ := ir^y • 

For simplicity, assume in what follows that /3 = 1 (in fact, one can always replace H 
by f3H) and denote pu '■= + / - H \ ■ Let 71 < 72 < • • • < j m be the eigenvalues of 



-fT and ei,...,e m be the corresponding eigenvectors. Let L r = l.s.({ei, . . . , e r }) and 



#<r : = Y7j=i lA e i ® e i)' H >r := Sjlr+i 7i( e i ® e j)- Jt is eas y to see tnat 



II L r PH Lr \\l Efc ^ ie - 7fc 

Under reasonable conditions on the spectrum of H, the quantity 5 r (H) decreases fast 
enough when r increases. Thus, pn can be well approximated by low rank matrices. 

The next statement follows immediately from Proposition SJ Here the unknown 
density matrix p is approximated by a Gibbs model with an arbitrary Hamiltonian. The 
error is controlled in terms of the L2(n)-distance between p and the oracle pu and also in 
terms of the alignment coefficient a(H< r ) for a "low rank part" H< r of the Hamiltonian 
H and the quantity 5 r (H). 

Proposition 6 For all Hermitian nonnegatively definite matrices H and for all e > 0, 
\\p £ ~ P\\l 2 m) < 2 \\PH ~ p\\\ m + 24 max E{Xe k , e k ) 2 5 2 (H) + a 2 (H< r )e 2 . 

Proof. We will use the last bound of proposition 0] with S = pu <r - Note that 
a(logp H < r ) = a(-H< r - logtv(e- H < r )I m ) = a(H< r ). 
Therefore, we have 

e 2 

\\P £ ~ P\\ 2 L 2 {Tl) ^ \\PH< T - P\\1 2 {U) + ea { H <r)\\pH< r ~ P\\l 2 (U) + y« 2 (^<r) < 
3 

O \\pH< r ~ HliaCn) + ^ 2 (H< r ). 
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In addition to this 



PH ~ PH< r \\L 2 (Tl) 



EfcLi e~ 7fc (e fe g e fc ) ELi e " 7fc ( e fc g e fc) 



i 2 (n) 



which can be easily bounded from above by 



2S r (H) max I let (8) ej. II r„cm = 2<5 T .(iI) max IE 




1/2 (^,e^) 2 . 



The result follows immediately (by the same argument as in the proof of Proposition [5]) . 



5 Random Error Bounds and Oracle Inequalities 

We now turn to the analysis of random error of the estimator p £ . We obtain upper 
bounds on the ^(n) and Kullback-Leibler distances of this estimator to an arbitrary 
oracle S £ S of full rank. In particular, this includes bounding the distances between p £ 
and p £ . As a consequence, we will obtain oracle inequalities for the empirical solution 
p £ . The size of both errors ||p £ — S^f^m and K(p £ ; S) will be controlled in terms of the 
squared L2(II)-distance \\S — Hlwri) from the oracle to the target density matrix p and 
also in terms of such characteristics of the oracle as the norm || log or the alignment 
coefficient a (log S) that have been already used in the approximation error bounds of the 
previous section (see propositionsOd]). However, in the case of the random error, we also 
need some additional quantities that describe the properties of the design distribution 
IT and of the noise £. These quantities are explicitly involved in the statements of the 
results below which makes these statements somewhat complicated. At the same time, 
it is easy to control these quantities in concrete examples and to derive in special cases 
the bounds that are easier to understand. 

Assumptions on the design distribution II. In this section, it will be assumed 
that X is a random Hermitian m x m matrix and that, for some constant U > 0, 
|| A || < U. We will denote 



a\ := ||E(A - EX) 2 ||, o\^ x := ||E(A ® X - E(X ® A)) 2 ||. 
Let L C C m be a subspace of dimension r < m and let Vl '■ M m (C) h-» M m (C), 



□ 



Vlx := x — P l a_xP l a_. We will use the following quantity: 



/3(L): 



sup \\V L A\\ L2{n) . 

IWU 2 (n)<l 
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Note that ||"Plj4||2 < \\A\\2 (for a proof, choose a basis {ei,...,e m } of C m such that 
L = l.s.(ei, . . . , e r ) and represent the linear transformations in this basis). If, for all 
A, Ki\\A\\ 2 < \\A\\ L2 (ii) < #2im|2, then j3[L) < KijK\. In particular, if K x = K 2 , 
then (3{L) = 1 (which is the case, for instance, when X is sampled at random from an 
orthonormal basis). 

Assumptions on the noise £. Recall that E£ = and let u\ := E£ 2 < +00. We will 
further assume that the noise is uniformly bounded by a constant ce > : |£| < c^, and 
the proofs of the results of this section will be given under this assumption. Alternatively, 
one can assume that the noise is not necessarily uniformly bounded, but WCW^ < +00. 
This includes, for instance, the case of Gaussian noise. For such an unbounded noise, one 
should replace in the proofs of theorems [U [S] and below the noncommutative Bernstein 
inequality of Ahlswede and Winter by the bound of Proposition [2J One should also use a 
version of concentration inequality for empirical processes by Adamczak (2008) instead 
of the usual version of Talagrand for bounded function classes (see Section 3). 

Given t > 0, denote t m := t + log(2m), r n := t + loglog 2 (2n) and 



We will start with a simple result in spirit of approximation error bound of Proposition 



Theorem 4 There exists a constant C > such that, for all S £ S and for all e > 0, 
with probability at least 1 — e~* 



e n , m := (apx V ^ ||EX|| V a x ^ x )\ — \J (c^U V U 2 ) 




t 



n 



in 



3 




V 



(5.1) 



and 




V 



(5.2) 




t 



(5.3) 



n 



in 



26 



Note that this result holds for all e > 0, including the case of e = that corresponds 
to the least squares estimator over the set S of all density matrices. The approximation 
error term || log S\\e in the bounds of Theorem 0] is of the order 0(e) (as in Proposition 
[3]) and the random error terms are, up to logarithmic factors, of the order 0{-^=) with 
respect to the sample size n. 

The next result provides a more subtle oracle inequality that is akin to approxi- 
mation error bounds of Proposition HJ In this oracle inequality, the approximation error 
term due to von Neumann entropy penalization is a 2 (logS)e 2 (as in Proposition so, 
it is of the order 0(e 2 ). Note that it is assumed implicitly that a 2 (logS) < +oo, i.e., 
that S is of full rank and the matrix log S is well defined. The random error terms are 
of the order 0(n _1 ) as n — > oo (up to logarithmic factors) with an exception of the 
term a^(ax V ||E.X~||)||P L x/S'P L x which depends on how well the oracle S is ap- 
proximated by low rank matrices. If \\P L ±SP L ± ||i is small, say of the order n _1//2 for a 
subspace L of a small dimension r, this term becomes comparable to other terms in the 
bound, or even smaller. The inequalities hold only for the values of regularization param- 
eter e above certain threshold (so, this result does not apply to the simple least squares 
estimator). The first bound shows that if there is an oracle S 6 S such that: (a) it is "well 
aligned", that is, a(logS') is small; (b) there exists a subspace L of small dimension r 
such that the oracle matrix S is "almost supported" in L, that is, ll-P^xS-P^ ||i is small; 
and (c) S provides a good approximation of the density matrix p, that is, \\S — /o||^ 2 ( n ) ^ s 
small, then the empirical solution p £ will be in the intersection of the L2(II)-ball and the 
Kullback-Leibler "ball" of small enough radii around the oracle S. The second bound is 
an oracle inequality showing how the L2(II)-error \\p £ — p\\\ 2 (j^ depends on the properties 
of the oracle S. 

Theorem 5 There exist numerical constants C > 0, D > such that the following holds. 
For all t > 0, for all A > 0, for all e > De n ^ m , for all subspaces L C C m with dim(L) := r, 
and for all S G S, with probability at least 1 — e _t , 



\P £ ~ S|lL(n) + \K{p £ ; S)<(l + \)\\S- H|| 2(n) + - 



(logS)e 2 \/ (5.4) 



2/3 2 (L) !^+^ V (axV ||EX||)||P l ± SP L ,\\ ^^^V^ 2 - 
<■ n v V n v n v n 



27 



and 



\p £ -p\\ 2 L 2( n)<(l + mS-p\\l 2(u) + j 
at(a x V\\EX\\)\\P L xSP L ' 



a 2 (log5) e 2 V-I/3 2 W ! ^ i V 



X 1' 



\J Ci u^\Ju 



n 



(5.5) 



Next we give a version of (|5.4p in a special case when S = p £ . This provides bounds 
on random errors of estimation of the true penalized solution p £ by its empirical version p £ 
both in the L2(H) and in the Kullback-Leibler distances. Note that unlike the bounds for 
an arbitrary oracle S, there is no dependence on the alignment coefficient a (log p £ ) in this 
case. The result essentially shows that as soon as the true solution p e is approximately 
low rank in the sense that P L ±p £ P L ± is "small" for a subspace L of a "small" dimension 
r and p £ provides a good approximation of the target density matrix p, the empirical 
solution p £ would also provide a good approximation of p and it would be approximately 
low rank. 



Theorem 6 There exist numerical constants C > 0, D > such that the following holds. 
For all t > 0, for all e > De n ^ m and for all subspaces L C C m with dim(L) := r, with 
probability at least 1 — e - *, 



C 



p £ -p £ \\l 2(u) +eK(p £ ;p £ )< 
mr + r n 



a £ 2 /3 2 (L)^^V^(^V||EX||)||P x .p^|| lv /^\/ 



n 




U\\p £ - p\\l 2 {U) 



2 v 



Piii-V c ^ 

n v 



T n V t r 



n 



(5.6) 



Remark. In the case when the noise is not necessarily bounded, but H^Lj < +oo, 
the results still hold with the following simple modifications. In bounds (|5.ip . (|5.2|) . (|5.3[) 
and in the definition of e njm , the term (c^U V U 2 )^ is to be replaced by 



(T 5 fix. 

In the bounds of theorems [5] and El the term c^U Tn '^ i tm is to be replaced by 



J7I 

n 



We will provide a detailed proof of Theorem The proof of Theorem H] is its 
simplified version. The proof of Theorem [6] relies on the bounds derived in the proof 
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of Theorem [5j It is also possible to derive the oracle inequalities of Theorem [5] from 
Theorem [6] and from the approximation error bounds of Proposition UJ Throughout 
the proofs below, C,C±, . . . are numerical constants whose values might be different in 
different places. 

Proof of Theorem [5J Denote 

n 

L n {S) := n- 1 Y,(Yj ~ tT{SXj)) 2 +e tr(51og5). 
3=1 

For any S £ S of full rank and any direction v 6 M m (C), we have 

n 

DL n (S;u) = 2n- 1 Y,((S,X j ) - Y^Xj) + e ti{vlogS). 
3=1 

By necessary conditions of extrema in the convex optimization problem (jl.2p . DL n (p £ ; p £ — 
S) < 0, which implies 

DL(p £ ;p £ -S)-DL{S;p e -S) < -DL(S; p £ -S)+DL(p £ ; p £ -S)-DL n (p £ ; p £ -S). (5.7) 
Note that 

DL(p £ ; p £ -S)- DL(S; p £ - S) = 2\\p £ - S\\l m + eK(p £ ; S) 

(see the proof of Proposition [3]) and 

DL(S; p £ -S) = 2(S -p,p £ - S) L2{U) + etv((p £ - S) log S). 

By a simple algebra similar to what has been already used in the proofs of propositions 
[3l [H we get the following bound: 

np £ - S\\l 2{u) + 2(S -p,p e - S) L2(U) + eK(p £ ; S) = (5.8) 
W ~ S\\l m + W- HlL(n) - IIS - Pll! 2( n) + eK(p £ ;S) < 

-etr((p £ - S) log S) - -ir((p £ - S,X,) 2 - E(p £ - S,X)A + 

71 3=1 V ' 
n , v n 

- [(S - p, X )(p £ - S, X,) - E(S - p, X)(p £ -S,X))--J2 0(/5 £ - S, X,). 
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Since e\tr{{p e - S) log S) \ < sa(logS)\\p £ - 5|| L2(n ), we get from d5THD that 

\\P £ ~ S\\l 2i u) + W ~ P\\l 2 (u) + zK(p £ ; S) < (5.9) 
\\S-P\\l m +ea(logSW ~ S\\ L2i u) ~ - £ ( ^ ~ S > X tf ~ E ^ " 5 > X > 2 ) + 

3=1 V 7 

2 n ( \ 2 n 

-J2 Us- p,x 3 w - s,Xj) - e(s - P ,x){ff -s,x)\- -J2^(p £ - s > x i)- 

3=1 V J 3=1 

We need to bound the empirical processes in the right hand side of bound (|5.9p . We 
will do it in several steps by bounding each term separately. 
Step 1. To bound the first term note that 

-Y,Up £ -S,X 3 ) 2 -^-S,xA = /(^-5)®(^-5),-^((X i ®X i )-E(X®A'))y 

71 7=1 ^ 7 ^ U 7=1 ' 



Therefore, 



<\\p £ -s\\i 



n 

- ^({Xj ® Xj) - E{X ® X)) 



Note that \\X ® X\\ = \\X\\ 2 < U 2 and also \\X ® X - E(X ® X)\\ < 2U 2 . Using 
noncommutative Bernstein's inequality (see (|3.2p in subsection 3.3) we can claim that 
with probability at least 1 — e~* 



1 n 

-^((X j ®X j )-E{X®X)) 



3=1 



and, with the same probability, 



<4| < ,.v M> /'' + '° g(2 '" 2) V^- + "' g(2 '" 2: 



n 



n 



li^Up e -s,x 3 f-E(ff-s,xA 
.1=1 v J 



< 



t + log(2m 2 ) v , 2 1 + log(2m 



n * n 

Step 2. The second term can be written as 



\p £ -s\\1 



lj2((S-p, X 3 )(? - S, X 3 ) - E(S - p, X) <p £ -S,X)) 
n .1=1 V J 
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U ,=1 V / 



and bounded as follows 



~ E (< 5 - * X iW - S, X,) - E(S - p, X) <p e -S,X^j 
\ E ( ( S ~ Pi X i) X i ~ E < S " P> X ^ X ) 



< 



\\p £ -sh 



3=1 



We use again the noncommutative version of Bernstein's inequality to show that with 
probability at least 1 — e _i 



\ E (V - p> x ^ - E ^ - p> x ^ x ) 



< 



wis - Ph, m {-±^1 \/ 4uHs _ 4l ±±MM, 



n 

where we also used simple bounds ||E(5 — p, X) 2 X 2 \\ < U 2 \\S — p||| 2 m) an< ^ \\(S 
p, X)X\\ < U 2 \\S - p||i. Since ||,5 £ - S\\i < 2, we get 



1 n ( 

- E (5 - p, X,) (p £ - 5, X, ) - E(S - p, X) (p £ - S, X) 

H 3=1 V 



< 



Step 3. We turn now to bounding the third term in the right hand side of ()5.9|) . It 
is easy to decompose it as follows: 



l -iZ^p £ - s,x s ) = (P L ,(p £ - S)P L ,,^J2^X j P L s 

3=1 3=1 

1 n 

-Y.^-S.VlX,). (5.10) 



3=1 



Note that 

1 n 

P L ^ - S)P L x,-^iP^X j P L ^ 



3=1 



<\\P L ^p e -s)P L A\i 



3=1 
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Applying the noncommutative version of Bernstein's inequality one more time, we have 
that with probability at least 1 — e~* 

n 

rt 



1 n 

-Y,^{P^X j P L ^ - EP L xXP L , 



<2^V t + 1 °f (2m) \/2c^- t + l0g(2m) 



■;?. 



where we used a simple bound \\E(P L x (X - EX)P L ±) 2 \\ < \\E(X -EX) 2 \\ = a\. Also, it 
follows from the classical Bernstein's inequality and the bound ^(P^XP^-l)!! < ll^-^ll 
that with probability at least 1 — e~ l 



-^EP^XP^ = & \eP l .XP l , 

i=i j=i 

Hence, with probability at least 1 — 2e _< , 



< 2a^\\EX\\J -\J 2c^\\EX" ' 



n 



1 n 

P L x {p £ - S)P L x , - ^ P ^XjP L 



< 



21^(^-5)^11! 



a,(a x + \\EX\\^ t + l0 f^ \j2c,U 



t + log(2m) 



n y n 
To bound the second term in the right hand side of (15. lOf) . denote 

n 

a n {5) := sup 

Pl,P2&S ,\\pl- P2\\l 2 (ji)<S 



3=1 



Clearly, 



< a n (\\p £ — S'||i 2 (n))- To control a n (5), we use Ta- 
lagrand's concentration inequality for empirical processes. It implies that, for all 8 > 0, 
with probability at least 1 — e~ s , 



On(5) < 2 



Ea n {8) + <t 6 P{L)6J - + Ac^U- 
V n n 



(5.11) 



Here we used the facts that E£ 2 (pi — p 2 ,VlX) 2 < a^(3 2 (L)\\pi — /^[Iwm an< ^ 
S(pi-P2 t V L X)\ < c^\\ Pl - p 2 \\ 1 \\V L X\\ <2c^\\X\\ + \\P L± XP L± \\)<Ac^\\X\\ < Ac^U. 

We will make the bound on a n (5) uniform in 8 E [Un ,2U]. To this end, we apply 
bound ([531]) for 8 = 8j = 2^' +1 [7, j = 0,1,... and with s = r n := t + loglog 2 (2n). 
The union bound and the monotonicity of a n (5) with respect to 8 implies that with 
probability at least 1 — e~* for all 8 E [Un -1 , 2U] 



Oi n {8) < C 



Ea n (8) + atP(L)8J — + ci-U^ 
" n n 



(5.12) 
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where C > is a numerical constant. Now it remains to bound the expected value 
Ka n (5). Let e±, . . . ,e m be the orthonormal basis of C m such that L = l.s.jei, . . . , e r }. 
Denote Eij(x) the entries of the linear transformation x £ M m (C) in this basis. Clearly, 
the function (p\ — P2,Vl%) belongs to the space C := l.s. {Eij : i < r or j < r} of 
dimension m 2 — (m — r) 2 = 2mr — r 2 Therefore, 



Ea n (5) < E sup 

feC,\\f\\ L2(n) <p(L)6 



Using standard bounds for empirical processes indexed by finite dimensional function 
classes, we get Ka n (S) < 2^/2a^f3{L)5 \f^- ' ■ We can conclude that the following bound 
on a n {5) holds with probability at least 1 — e~* for all 5 6 [Un -1 , 2U] : 



a n (S) < C 



v 



n 



(5.13) 



Note that since ||p e -5||i < 2 and < U, we have \\p E ~ S\\ 2 L , n) = E(p £ -S,X) < 4U 2 , 
so, \\p £ — 5 , ||^ 2 (n) < 2C7. As a result, with probability at least 1 — e - *, we either have 



\P £ ~ S\\l 2 (u) < Un \ or 



< C 



-v^)i:/' : -^IU 2 (n) A /^+^/3(L)||^-5|| L2(n) J^+c^^ 



In the first case, we still have 



1 - 



S,V L Xj 



< c 



U hnT U FFn r n 

a tP\L) — \ +aeP(L) — J—+C£U— 

n V n n V n n 



Let us assume in what follows that \\p £ — S\\i 2 m\ > Un 1 since another case is even 
easier to handle. 

We now substitute the bounds of steps 1-3 in the right hand side of (|5.9p to get the 
following inequality that holds with some constant C > and with probability at least 
1 - 5e-* : 

|2 i use „||2 



W- S\\l m + W-p\\l 2[n) +eK(p e ; S) < 
\S ~ P\\l 2i n) + ea(\ogS)\\p £ - S\\ Lm + 

t-2 



(5.14) 



ie(*x»x^y u 2 \\ P £ -s\\l + iw\\s - P \\ Lm ^ V/ wu 

4\\PlAp £ ~ S)P L ±h 

mr + r n 



2 

n 



+ 



a,ia x V x j i ^\l2c < U- 



+ 



c 



a^{LW-S\\ Lm 



n 



\/^U 



n 
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Under the assumption e > De n ^ m with a sufficiently large constant D > 0, it is easy to 
get that 



S||?< 3l|p e -S||;<-A'(p< ; S). (5.15) 



Also, by Proposition [U 

IIP^-^P^H! < WP^fP^h + WP^SP^h <3||P L x5P L x|| 1 + 2K(p £ ;5), 
and, under the same assumption that e > De n ^ m with a sufficiently large constant D > 0, 



411^(^-5)^111 
C||Pz-l5P x x||i 



a i (a x + \\EX\\)J t -^\/2c^ 
V n v n 



< 



(5.16) 



a c (^V||EX||)J^\/ c ^l 



+ ^(p E ;S). 



Combining bounds (|5.15p and (|5.16p with ()5.14j) yields 

W - s\\l m + W - p\\l m + \k(p £ ; s) < 

\\S-p\\l m +^(logSW-S\\ L2{u) + 

c 



(5.17) 



S\\l 2 {iV)VzP{L) 



rar + r n 



\Ju\\s- P \\ L2{n) J^\J 



n 

2 v m 



\\Pl±SP l , V \\EX\\)\ - V ctU^^ V 

V re v n v n 

with some constant C > 0. It follows from the last inequality that 

W - S\\l 2{n) < A\\p £ ~ S\\ Lm + B- £ -K(p s ; S), 

where A := fa(log5) + Ca^(L)J^±^ and 



(5.18) 



H5-Plli 2( n)-Ilp £ -Pll 



i 2 (n) 



+ 



C 



115 - p\\Lm U f-^M W P L-SP L ± ha^ax V IIEXII)*/^ V c ^ 



r? 



2 

n 



It is easy to check that 



\\p £ -s\\ 



L 2 (n) 



< 



A+^/A 2 + 4{B - (e/4)K{pe;S)) 



<[A+\l{B--K(pf,S) 
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If lK{p e ; S) > B, then \\p £ - 5||| 2(n) < A 2 , which, in view of (l5TT8|) . implies 



\\p £ -S\\l 2{n) + -K(F;S)<A 2 + B. 

Otherwise, we have \\p £ - S'll^m) < A 2 + 2A^B + B - jK(p e ; S), which, for all A > 0, 
implies 

\\P £ ~ S\\l m + \Hp £ : S)<(\ + l)A 2 + (1 + A/2)fi. 

In both cases, by the definitions of A and B and by elementary algebra, one can easily 
get the bound 



\P £ ~ S\\l 2{u) + \\P £ ~ P\\l 2{u) + - 4 K(p £ ;S) < 



C 



(l + A)||S-p||£ 2(n) , , 
a^ax\f\\EX\\)\\P L ±SP L x\\i 



a 2 ! (log S)e 2 \J ap 2 (L) 



2n2/T\' mr + Tn 



n 



V 



V c t u —^—\/ u 



11 



(5.19) 



that holds with probability at least 1 — 5e~* and with a sufficiently large constant C. To 
replace the probability 1 — 5e~* by 1 — e~ t , it is enough to replace t by t + log 5 and to 
adjust the values of constants C, D accordingly. 



Proof of Theorem [4j We get back to bound (|5.8|) in the proof of Theorem [5j This 
time, we bound the term tr((/5 e — S) log S) in (|5.8|) in a slightly different way 

\tr((? -S)logS)\ < [|Iogfl'|||| / 3-5[| 1 <2||log5|| J 

which leads to the following bound (instead of bound (j5.9j) ): 

\\P £ ~ S\\l 2i u) + \\P £ ~ P\\l 2{ u) + eK(p s ; S) < \\S - P \\\ m + e\\ log S)\\ + (5.20) 

-J2[(S-P, XjW - S, Xj) - E(S - p, X) (?-S,X))--J2 (P £ - S, X 3 ). 

U 3=1 V J U 3=1 

To bound the empirical processes in the right hand side, we again use the bounds of 
steps 1-3 in the proof of Theorem The bound of Step 1 yields 



-iti^-S'XJ 2 -^- 15 '*) 2 ) 

71 3=1 ^ ' 



<ni^l t + losi2m2) \/u^ +loei2m2) 



n 



n 
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and it follows from the bound of Step 2 that 



n .7=1 V J 

t + log(2m) v , 2 t + log(2m) 



< 



8U\\S - p||L 2 (n) 



n Y n 

Instead of more complicated derivation of Step 3, we now use noncommutative and 
classical Bernstein's inequalities to get that with probability at least 1 — 2e~* 



n 



<\\P £ -S\\i 



n 



< 2 



n 



+ 



2111X1 



n 



-'£«> 

3=1 



<4(a^ + ||EX||)V t + 1 °f (2m) V 12 ^ 



t + log(2m) 



?i 



Using these inequalities, we derive from (|5,20p that with some numerical constant C > 
and with probability at least 1 — 4e _t , 



\P £ 



+C 



S\\ 



L 2 (n) 



+ \\P £ ~ P\\l 2 m) + ^(/5 £ ; S) < \\S - p\\l m + e\\ log 5|| + (5.21 



u\\s - Phm\— + v v ll EX ll)V — \A C ^ v C/2 )— 

w V n \ n v n 



which implies the result in the case when ||log5|| < logT. To finish the proof, it is 
enough, given an arbitrary S G S (even such that log S does not exist), to apply bound 
(15311 to S s = (1-S)S+S^, where 5 £ (0, 1). Clearly, || log S s \\ < log f and we also have 
|| 5 - S S \\l 2{n) < 4<5 2 E||X|| 2 (see the proof of Proposition ED . Taking <5 := ^JWxf A 1 ' [t 
is easy to complete the proof in the case when || log S\\ > logT. 



Proof of Theorem [6l Note that similarly to p £ , p £ is also a matrix of full rank 
and \ogp £ is well defined. By necessary conditions of extrema in convex problems (jl.2p 
and (|4.ip . we have DL n (p £ ; p £ — p £ ) < and DL(p £ ; p £ — p £ ) > 0. Subtracting the second 
inequality from the first one yields 



DL(p £ ; p £ - p £ ) - DL(p £ ; p £ - p £ ) < DL{p £ - p £ - p £ ) - DL n (p £ ;p £ - p £ 



(5.22) 



By a simple algebra already used in the proof of Theorem [5j this easily leads to the 
following bound: 

n 

2W-p £ \\l 2{n) +eK(p £ ;p £ )<2E(p £ -p,X)(p £ -p £ ,X)-2n^ 

3=1 
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which can be further rewritten as 



2||p £ -p £ |lL(n) +zK(p £ ;P £ ) < "^E(V -P £ > X j) 2 ~ E (P £ -P £ > X ) 2 ) ' ( 5 -23) 

j=i \ 

n , \ 2 n 

£ (p £ - p, -X» (p £ - p £ , X,-) - E(p £ - p, X) <p £ - p £ , X) - - ^(P £ ~ P £ ,X 3 ). 

.7 = 1 V ' .7 = 1 



We use the bounds of steps 1-3 of the proof of Theorem with S = p e to control 
each term in the right hand side of ()5.23p . Substituting these bounds in (|5.23p . we get 
the following inequality that holds with probability at least 1 — 5e~* : 



2W-p £ \\l 2{n) +eK(?- p z)< 

t + log(2m 2 ) v / 2 t + log(2m 



(5.24) 



8 ax®x 



n 



n 



16C/||p e - p\\l 2 {u) 



t + log(2m) 



\/ IQU 



\P £ -P £ \\l + 



2„ e I, * + log(2m) 
P "Pill r 



77. 



4|| j Plx( / 5 £ -p £ )P l x||i 



a^a x + ||EX| 



rj + log(2m) 



\j2etU 




t + log (2m) 
n 



+ 



Arguing exactly as in the proof of Theorem [5j we can simplify (|5.24p to get 

£ 

4' 



2||p £ -P £ lli 2( n) + 7^(p £ ;p £ )< 



(5.25) 



1RIT ii e || / t + log(2m) \ / 1<8rT g„ e H t + log(2m) 
16tf||p e - p||i 2( n)\/ V 16U WP ~ Plli + 

12\\P L xffP L x\\i 



a i {a x + ||EX| 



rj + log(2m) 



\/2c € C/ 



f + log(2m) 



+ 



^(£)||p e - ^|| L2(n) + a e /3(L)||p £ - pIlW^ + ^ 



It is easy now to solve this for \\p £ — p e ||L 2 (n) an d to derive the following explicit bound on 
the random error that holds with probability at least 1 — 5e~* and with some numerical 
constant C > : 

mr + T n v / „T« 



/5 £ -p £ Hi 2( n)+^(/5 £ ;P £ )<C 



^W) : 



^l|p £ -p||L 2 (n) 



rj + log(2m) 



71 



, .. t + log(2m 

"P -Pill 

n 



V 



^p^Hi (r € (o* V||EX| 



t + log (2m 



77, 
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which easily implies the result. 



Example 1. Matrix completion (continuation). Recall that, in this example, 
{ei : i = 1, . . . , m} is the canonical basis of C m and the following set of Hermitian 
matrices forms an orthonormal basis of M m (C) (the matrix completion basis): 



e» <8 &i : i = 1, 



(d (8 ej + ej (8 &i) : 1 < i < j < m 



^j|-^=(ei (8) ej —ej<g>&i):l<i<j< ml. 

Assume that X is sampled at random from this basis. Recall that in this case, for all 
matrices A, \\A\\ 2 L2 ^ = ttt, - 2 1| ^4|| | . Obviously, ||ej (8 ej|| = 1, i = l,...,m and, for all 

i < j, 

1 / 1 
-^={ei (g> ej + ej (g> ei) = -j=, 

Therefore, IIXII < U = 1. We also have 



— p(ej <8 ej - ej <8 



1 

71' 



a| < ||EX 2 | 



sup E{X 2 v,v) = sup E{Xv,Xv)= sup E|Xt;| 2 . 
ueC m ,|«|=i uec m ,H=i vec m ,\v\=i 



Note that, if X = ej(g>ej,i = l,...,m, then |Xt;| 2 = \ei(ei,v)\ 2 = \(ei,v)\ 2 . If X 
-j^ifii ® ej + e-j ® ej), i < j, then 

\Xv\ 2 = -\ei(e jy v) +ej( ei ,v)\ 2 = ^(|(e„t;)| 2 + \(e t ,v)\ 2 ^j 

and, similarly, if X = -^(ei®ej — ej®ei),i < j, then also \Xv\ 2 = ^\{ej ,v)\ 2 + \{ei,v)\ 2 
Therefore, for \v\ = 1, 

m 

E\Xv\ 2 = m- 2 ]T |<e^>| 2 + 2™- 2 -Y,(\(ej,v)\ 2 + \( ei ,v)\ 2 ] < 



i=l i<j 

m~ 2 \v\ 2 + m~ 2 m{\v\ 2 + \v \ 2 ) < 2>mT l , 

which implies that ax < By a similar simple computation, ax®x < 77=- Now we 
can derive the following corollary of Theorem [5j Let 

e n , m := (a^m- 1 / 2 V mT 1 ^)^^ V 1)^ 
and let e = De n ^ m for a sufficiently large constant D > 0. 
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Corollary 1 There exists a numerical constant C > such that the following holds. For 
all t > 0, for all A > 0, for all sufficiently large D and for e = De nm , for all matrices 
S € S of rank r, with probability at least 1 — e~ l , 



C 



/>|li 2( n)<(l + A)||5-p||i 2(n) + - 



( c |vi)^k 



log 2 (mn)\/a|^ \y c ? 



71 n 



(5.27) 



Proof. First observe that for all matrices S G S of full rank (for which log 5 exists) 
and for all subspaces L C C m with dim(L) = r, we have, with probability at least 1 — e~ l 
and with an arbitrary A > 

t-2 



2(1 

|p £ -p||! 2( n)<(l + A/2)||5-p||| 2(n) + T - 



a 2 (log5)f(a|vl)^ + (4vl)M \/ 
\ 5 ran 5 n / v 



r? 



x i 




(5.28) 



This immediately follows from Theorem [5] since, in the case under consideration, f3(L) = 
1, ax < S 1 / 2 ?™ -1 / 2 , crx®x < 4m -1 / 2 , U = 1. Note also that in this case A(L) = m (recall 
the definition of A(L) given before Proposition [5]) and 

a(log S) < m inf || log 5 + cl m ||2. 

c 

Suppose now that S 6 5 is an arbitrary oracle of rank r. Then there exists a subspace L 
of dimension r such that P L ±SP L ± = 0. We will use bound (|5.28p for S s := (l-<5)S+<5^, 
where <5 = e A 1, as we did in the proof of Proposition [5j As in this proof, we have, for 
some constant C\ > 0, 

1 + — J < C\m\J~r log(mn) 
||5 - S 5 \\l 2{u) < 4<5 2 E||X|| 2 < 4<5 2 < 4e 2 . 



and 



Finally, note that 

\\Pl^SsP^\\i<(1 



\P L ±SP L ^\\ 1 + 5\\P L ±(I m /m)P L ^\\ 1 <5<e. 



Substituting these bounds in (|5.28p (with S replaced by Ss) and bounding \\Ss — p||^ 2 (n) 
in terms of \\S — p||^ 2 ( n ) an d \\Ss — £1li 2 (n) (similarly to what was done in the proof of 
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Proposition [5]), it is easy to derive (|5.27[) from (|5.28|) . Note that we can drop the term 
a 2 ^ since it is dominated by (cr| V l) r22 ^ aL log 2 (ran). 

□ 

Similarly, it is easy to obtain another corollary where the Z/2(II)-error of estimator 
p e is controlled in terms of Gibbs oracles. Recall the notations at the end of Section 4 
and also denote T r := ||i?<r||l = Ylk=i 7fc- 

Corollary 2 There exists a numerical constant C > such that the following holds. For 
all t > 0, for all A > 0, for all sufficiently large D and for e = De n ^ m , for all Hermitian 
matrices H and for all r < m, with probability at least 1 — e _i , 



C 



\p £ - PiiLoi) < (i + a) hp* - P \\l m + x ^ V d2 ( K 2 v V 



5 2 . (H) v / 2 ( i 2 -,\^r m U. 



m* Y \ n 



( C | v i) Trm l t2 A \/ *t mr+Tn v A-^^ v - 



(5.29) 



Example 2. Pauli basis (continuation). We now turn to another example de- 
scribed in the Introduction, the example of the Pauli basis. Recall that in this case m = 2 k 
and we are considering the basis of the space M 2 fc (C) that consists of all matrices of the 
form Wi x ® • • • ® Wi k , Wi = ^fi, * = 1, ■ ■ ■ ,4 being normalized 2x2 Pauli matrices. 
Note that HWJb = 1 and ||WJ| = -7=. The design variable X is picked at random from 
this basis. We still have H-AH^n) = r?T-~ 2 ||v4|||. However, now 

\\W h ®---®W ik \\ = || Wi x || — || W* fe II = (A^j =2- k l 2 = m- 1 ' 2 

implying that \\X\\ = m" 1 / 2 and U = m^ 1 / 2 . To state a corollary of Theorem in this 
case, we take e := De n m , where 



..,,,„ . - (^m-^Vm-^V^V™" 1 )-- 

V n v n 



The following results are similar to corollaries Q] and [2j 

Corollary 3 There exists a numerical constant C > such that the following holds. 
For all t > 0, for all A > 0, for all sufficiently large D > and for e = D£ n>m , for all 
matrices S £ S of rank r, with probability at least 1 — e~ l , 



\\p e -p\\ 2 L 2i u)<^+ms- P \\i 2iu) + j 

t 2 w -l\ rrn ^rn \ 1 2/ \ \ / 2 T n \ / -1/2 T n V t m \ i t m 

( c € Vm )^2-J lQ g (jnn)\/ o^-y C( .m ' —— \j — 



D 2 [{a 2 ym- l ) r -^\J 



(5.30) 
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Corollary 4 There exists a numerical constant C > such that the following holds. For 
all t > 0, for all A > 0, for all sufficiently large D and for e = De nm , for all Hermitian 
matrices H and for all r < m, with probability at least 1 — e - *, 



C 



\P £ - P \\i 2(n) < (1 + X)\\ph - Hl£ 2( n) + J 



m l v V n v 



(c? V m x ) 



2+2 



V 



, mr + r, 



r? 



- V c € m V — 

v n v mn 



(5.31) 



Note that the bounds of corollaries [T][5] can be also proved in the case when the 
noise is unbounded, in particular, Gaussian (see the remark after Theorem [6|) . For the 
Pauli basis, this immediately leads to Theorem [3] stated in the Introduction. 

6 Oracle Inequalities: Subgaussian Design Case 



In this section, we turn to the case of subgaussian design matrices. More precisely, we 
assume that X is a Hermitian random matrix with distribution II such that, for some 
constant 6o > and for all Hermitian matrices A £ M m (C), (A,X) is a subgaussian 
random variable with parameter froll^llz^n)- This implies that EX = and, for some 
constant b\ > 0, 



(A,X) 



< h\\A\ 



li'2 



L 2 



(n) , A £ M m (C). 



In addition to this, assume that, for some constant 62 > 0, 



\A\ 



L 2 (n) 



(A,X) 



L 2 (n) 



<h\\A\\ 2 , A £ M m (C). 



(6.1) 



(6.2) 



A Hermitian random matrix X satisfying the above conditions will be called a subgaus- 
sian matrix. Moreover, if X also satisfies the condition 



\A\\ 2 



E\{A,X)\ 2 = \\A\\l A£M m (C), 



(6.3) 



then it will be called an isotropic subgaussian matrix. As it was already mentioned in 
the introduction, the last class of matrices includes such examples as Gaussian and 
Rademacher design matrices. It easily follows from the basic properties of Orlicz norms 
(see, e.g., van der Vaart and Wellner (1996), p. 95) that for subgaussian matrices ||-A||L p (n) : 



(A,X) 



< c p 6i62||^4||2 an d \\A\ 



4>i 



(A,X) <cb 1 b 2 \\A\\ 2 ,A€M m {C),p>l, 



with some numerical constants c p > and c > 0. 



pi 



The following is a version of a well known fact (see, e.g., Rudelson and Vershynin 
(2010), Proposition 2.4). 
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Proposition 7 Let X be a subgaussian m x m matrix. Then, there exists a constant 
B > such that 



\X\ 



< By/rH. 



Proof. Let M C S m ~ l := {u £ C m : \u\ = 1} be an e-net of the unit sphere in C m 
of the smallest cardinality. It is easy to see that card(M) < (1 + 2/e) m and 

||X|| = sup (Xu,v) < (1 — e)~ 2 max (Xu,v). 

Take e = 1/2. Using standard bounds for Orlicz norms of a maximum (see, e.g., van der 
Vaart and Wellner (1996), Lemma 2.2.2), we get that, with some constants C\,C2,B > 0, 



\X\ 



< 4 



'02 



max(Xu,v) < Ciip 2 X (card 2 (M)) max (Xu,v) 

u,v£M if> 2 " u,v£M 



< 



i>2 



C*2 y / logcard(M) max (X,u®v) < C2 \/\og card(M) max ||w®v||2 < By/m. 

u,v£M ip2 ' u,v£M 



Below, we give oracle inequalities and random error bounds in the subgaussian 
design case. We will use the following notations. Given t > 0, let 

t m := t + log(2m), T n := t + loglog 2 (2n), and i n>m := r n log n V t m . 

Also, denote c 5 := ||C||v> 2 lo S ^f 1 and let 



/ fnt m \ 1 



mtr 



n 



(clearly, we assume here that the noise has a bounded V^-norm). 

Theorem 7 There exist constants C > 0, c > suc/t i/iaf t/ie following holds. For all 
t > and A > smc/i £/ta£ r n < cA 2 n, /or all S £ S and for all e £ [0, 1], with probability 
at least 1 — e~* 



l|p £ -S||| 2(n) <(l + A)|| 1 S-p||| 2(n) + C7 
— V(c 5 V^rn) 



£ (||.o gS || A .o g =)V.^V 



n 



and 



||/5 £ -/9||| 2(n) <(l + A)||5-p||| 2(n) + C 

mt m\l ( ,. /—.y/mtns 



e II iogSl| a lot 



m 



(6.4) 

V 

(6.5) 
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In particular, 



\P £ - P\\l 2{n) < C 



m) 



mt r 



11 



We now turn to more subtle oracle inequalities that take into account low rank 
properties of oracles S G S. 

Theorem 8 There exist numerical constants C > 0, D > 0, c > such that the following 
holds. For all t > and A > such that r n < cA 2 n, for all e > D£ n>m , for all subspaces 
L C C m with dim(L) := r and for all S £ S, with probability at least 1 — e~ l , 



\p £ -s\ 



i 2 (n) + 4^ £ ^)<(l + A)l|5-p||I 2( n) + 



C 



a\lo g S)e*\Jo-p\L) 



mr + T n 
n 



\fat\\P L ±SP L 



-L 111 



■\/( c « V 



m) 



(6.6) 

Tntn,m 



n 



and 



\P £ 



p|lL(n)<(l + A)||5-p||i 2{ri) + ^ 



J (log5)e 2 \/-I 



mr + r n 



■n 



V 



^||fLxSfLx||l^V(«£ V ^) V "" 



■;?. 



n 



(6.7) 



Similarly to the previous section, we also derived bounds on the random error \\p e 



« e ll 2 



Theorem 9 There exist numerical constants C > 0, D > 0, c > suc/i that the following 
holds. Under the assumption that r n < cn, for all t > 0, for all e > De n ^ m and for all 
subspaces L C C m with dim(L) := r, with probability at least 1 — e _t , 



\p £ -p £ \\ 2 L2{u) +zK(?;pC)<C 



P\\L 2 (U) 



mt T 



n 



\/( c ? V 



771 



mr + T n 



n 



y<Tt\\p L ±ffp L ±h 



mt r 



n 



n 



V 

(6.8) 



We will give only the proof of Theorem [HJ 

Proof. It follows the lines of the proof of Theorem [5] very closely. The main changes 
are in the bounds of steps 1-3 of this proof that have to be modified in the subgaussian 
design case. The rest of the proof is straightforward. 
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In Step 1, we have to bound the following quantity: 



lj2((p e - S,Xj) 2 -K(p^ - S,Xf). 

7 = 1 ^ ' 



To this end, we will study the empirical process 



A n (5) := sup 
feT s 



j'=i 



where F s := {(Si - S 2 , ■) : S u S 2 G S, ||Si - S 2 || M n) < <*}• Clearly 



-£(<p e -S,Xj) 2 -E</f-S,X> : 

Tl . \ 
.7 = 1 



<A n (||/f-S|| Mri )). 



Our goal is to obtain an upper bound on A n (<5) uniformly in 5 G [(m/n) 1 / 2 , 262]- First we 
use a version of Talagrand's concentration inequality for empirical processes indexed by 
unbounded functions due to Adamczak (see subsection 3.2). It implies that with some 
constant C > and with probability at least 1 — e~* 



(6.9) 



A n (S) < 2EA n (<5) + C5\r- + c^^. 

V n n 

Here we used the following bounds on the uniform variance and on the envelope of the 
function class J 72 : for the uniform variance, with some constant c > 0, 

\4 _. 



sup (P/ 4 ) 1 / 2 



sup e 1 / 2 (s*i - s 2 ,xy- 

Si,S 2 &S,||Si-S 2 || L2( n)<5 



sup 

Si,S , 2 e5,||5 1 -S 2 || i2(n) <,5 



I Si - S 2 ||| 4(n) < c ^ 2 



by the equivalence properties of the norms in Orlicz spaces. For the envelope, 
c2/v\ la c_ v\2 s- a\w\\1 



sup r(x) = sup (Si - s 2 ,xy < 4\\X\ 

fefg 5i,5 2 G5,||5i-5 2 || i2( n)<5 



and 



max sup f 2 (Xi) 
i<i<n feJ r s 



< Cl 


\\xf 


logn < c 2 


||X|| 











1p2 



log n < C3171 log n, 



for some constants ci, 02,03 > 0, where we used well known inequalities for maxima of 
random variables in Orlicz spaces (see, e.g., Lemma 2.2.2 in van der Vaart and Wellner 
(1996)). 
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To bound the expectation EA n (5) we use a recent result by Mendelson (2010) (see 
subsection 3.2; in fact, even earlier result by Klartag and Mendelson (2005) with the 
-02-diameter instead of ^i-diameter would suffice for our purposes). It gives 



EA n (5) < c 



sup 



4>i ' 



n 



V 



n 



(6.10) 



with some constant c > 0. It follows from f)6. 1 1) that the if)% and i/^-norms of functions 
from the class J-$ can be bounded from above by a constant times the L2(-P)-norm. As 
a result, 



sup 



< c5 



3.11) 



and the following bound holds for Talagrand's generic chaining complexities: 

72(^,5; 1P2) < 72(^5; c|| • ||i 2 (n)), (6-12) 

where c is a constant. Let G be a symmetric real valued random matrix with independent 
centered Gaussian entries {gij} on the diagonal and above, where Eg£ = 1 and Egj- = 
j- Then, using condition (|6,2p . we have that, for some constant c\ > 0, 



E\{S U G) - (S 2 ,G)(< 



\S\ — S2H2 > Ci 1 1 aSi — S 1 ; 



2llz, 2 (n)' 



and it easily follows from Talagrand's generic chaining bound that, for some constant 
C > 0, 



72(^5; 4 ■ || L2(n) ) < CE sup \(S 1 -S2,G)\=:Cu;(G;6). 



It follows from (|6T0jh (ffiTTTjh (|6T2jl and (1031) that 

,cj(G;6) v ,oj 2 {G;5) 



3.13) 



EA n (5) < C 



n 



V 



n 



(6.14) 



To bound Esup s Sa65 1 



i 2 ( n ) 



<<5 



{Si — S2, G) 



\{S\ — S2,G}\, note that 



< l|5i- S 2 \\ ill Gil < 2||G||, 



and, by Proposition 
uj(G;S) =E 



sup 

Pi,P2G5,||pi-p 2 ||i, 2 (n)<'5 



(S 1 -S 2 ,G) < 2E||G|| < c^E. 
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Substituting this bound in (|6.14p yields that, for some constant C > 0, 

Irn \ i m 



EA n (5) < C 



V 



n Y n 

and combining (|6.15p with (|6.9p gives that with probability at least 1 — e~* 



A n (5) < C 



fmy mw §2 JTy milogn 
V n v n v V n 



(6.15) 



3.16) 



It is easy to make bound (|6.16p uniform in 5 € [(m/n) 1 / 2 , 262] by a simple discretization 
argument (as we did in Step 3 of the proof of Theorem [5]). This leads to the following 
result: with probability at least 1 — e - *, for all 5 6 [(m/n) 1 / 2 , 262], 



A n (<5) < C 



<J^V-V^/^V mTnlogn 

V n v n v V n v n 



(6.17) 



where r n = i + loglog 2 (2n). Thus, with the same probability and with a proper choice 
of constant C > 



7=1 ^ ' 



< 



c 



\p -swiewd- y-ywp -sw^d-y — - — 



provided that \\p £ - 5[|z, 2 (n) G [(m/n) 1 / 2 , 26 2 ]- 

Similarly to Step 2 of the proof of Theorem [51 we have to bound the expression 



: - 5 ' ^ E (< 5 - *°> - E ^ - * x > x ) )• 



We use the bound 



- E ( < 5 - - 5 ' - E ^ - -°> x > (p e ~ s ' x 
n 3=1 v 

l J2 ( (S - P, Xj)Xj - E<5 - p, X)x) 

H 3=1 V 7 



< 



l/5 £ -5||i 



and Proposition [2] with a = 1. Note that 

||E(5 - P,X) 2 X 2 || < E(5 - p,X) 2 ||X|| 2 < E 1 / 2 (5' - p,X) 4 E^ 2 \\X\\ 4 < cm\\S - p|| 2 2(n) 



46 



with a constant c > 0. Also, 



\\(S-p,X)X\\\\ i!1 = \(S-p,X)\\\X\\ <d||(5-p,X)||^ ||X|| < c 2V ^||S-p|U 2(n) 

with some constants c\,C2 > 0. Finally, note that 

W - S\\ Lm < b 2 \\S - p\\ 2 < h\\S - p\\i\\S - p\\ < 46 2 , 

since, for S,p G S, \\S - p\\ t < 2 and \\S - p\\ < 2. Using the fact \\p £ - S\\i < 2, 
Proposition [2] and the previous bounds imply that with probability at least 1 — e _< and 
with some constants C±, C2, C > 0, 



- E ( ( S ~ Pi X ^(P E ~ 5 ' X i) ~ E < 5 ~ P> X > (P £ ~ S ' X ^) 
n 1=1 V 7 

- E ( ( S - P> X i) X i ~ E < 5 " P> X )A 
.7=1 V J 



< 



Ci 



W-sh 



n 



n 



m\\S- p\\l 2 (u) 



< 



C 



\\s-p\\ L ( n) \/ — + log( - 2m ^ y + lQ g( 2m )) 



n " n 

We now modify the bounds of Step 3 of the proof of Theorem [5j We need to bound 
the following expression: 

n 1 n , n 



3=1 



3=1 



3=1 



< ll^(/3 £ -<S)P L x||i 



As in the proof of Theorem [5j 

P L x (p £ - S)P r x , - ZjPi+XjPL 

n 3=1 1 " 3=1 

By Proposition [2j it is easy to show that with probability at least 1 — e~ l , 

1 " 

-E^-w 



< 



C 



V 1 1 



i>2 



log 



lltll* 11*11 



f 2 \ t + log(2m) 
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We replace a x , WEX 2 ]] 1 / 2 and 
which yields a simplified inequa" 



X\ 

ity 



V>2 



by an upper bound c-^/m (see Proposition [7]) 



1 - 



< C 



m(t + log(2m)) 



n 



\JUUz log 



11(11 



i>2 



m(t + log(2m)) 



Hence, with probability at least 1 — e t , 

i i _ZL 

P L x (/f - 5)P L x , - ^ SjPLxXjPv 



0=1 



< 



c\\p L ±(jf - S)P L 



-L 1 



m(i + log(2m)) 



n 



11(11 



■02 



m(i + log(2m)) 



n 



The remaining term - Y^j=i £? ~~ ^ ^L^j) is bounded exactly as in Step 3 of the proof 
of Theorem [5] with the use of Adamczak's (2008) version of Talagrand's concentration 
inequality. This leads to the following bound: with probability at least 1 — e - *, 



I n 

-Y,^{p e s,P L x j 



< C 



( r^(L)||p £ - < S|U 2(n)A / ! ^ : +^IIP e -5||L 2 (n) A /^+||(IU 



n 



mT TI lo 



n 



where T n = t + loglog 2 (2n). 



For simplicity, we state the next corollaries (similar to corollaries [T] and [2|) only in 
the case of subgaussian isotropic design. Recall that in this case || • ||z, 2 (n) = II • lb and 
PiL) = 1. 

Corollary 5 There exist numerical constants C > 0, c > such that the following holds. 
For all t > and A > such that r n < c\ 2 n, for all sufficiently large D > and for 
e = De nm , for all matrices S £ S of rank r, with probability at least 1 — e - ', 

a 



|p £ -p||| 2( n)<(l + A)||5-p||i 2(n) + ^ 



2 / 2 rrn ^m 
JJ | <7£ 



af^\J(c^V^) 



mt r 



n 



(6.18) 



Corollary 6 There exists numerical constants C > 0, c > such that the following 
holds. For all t > and for all A > such that r n < c\ 2 n, for all sufficiently large D 
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and for e = De nm , for all Hermitian matrices H and for all r < m, with probability at 
least 1 — e~ l , 



W ~ P\\l m < (1 + \)\\PH ~ P\\\ m + J 



5l(H)\l D 2 [ap^\l 



2 T r mtl l \\ / 2 mr + T n\l ( ,. /—s^/mtr 



(6.19) 



In a special case of Gaussian noise, the bounds of the above corollaries can be sim- 
plified since in this case < ca^ for some numerical constant c. In particular, Corollary 
[5] immediately implies the bound of Theorem [2] in the Introduction. Both bounds of 
Theorem [1] follow from theorems [7] and 
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